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Description 

Tt>e present invention concerns the coding of 
nr>ovir>g images in which a iiuman face is repfe- 
sented. It Is concerned to achieve low transmission 
rates by concentratir>g on movements associated 
with speech: The invention also pwrmits tlte synthe- 
sis of such images to accompany real or synthetic 
speech- 

It has already been proposed (see BELL LAB- 
ORATORIES RECORD, vol. 48, no. 4. April 1970. 
pages 110-115. Murry Hill. US; F.W: MOUNTS: 
"Conditional replenishment: a promising technique 
for video transmission") to reduce tt>e required 
transmission rate for a moving image by comparing 
successive frames of the image and transmitting 
data only in respect of those parts of the frame 
which have chan. ,ed since the previous frame. The 
present invention aims to talte advantage of the 
knowledge that, in transmitting an image of a face, 
the main information content lies in movements of 
the mouth. 

According to a first asp)ect o? the invention 
there is provided an apparatus for encoding a mov- 
ing image including a human face comprising: 
means for receiving video input data: 
means for output of data representing one 
frame of the image: 

identification means arranged in operation for 
each frame of the image to identity that part of the 
input data corresponding to the mouth of the face 
represented and 

(a) in a first phase of operation to compare the 
mouth data parts of each frame with those of 
other frames to select a representative set of 
mouth data parts, to store the representative set 
and to output this set; 

(b> in a second phase to compare the mouth 
data part of each frame with those of the stored 
set and to generate a codeword to be output 
indicating which member of the set the mouth 
data part of that frame most' closely resembles. 
It will be appreciated that this procedure makes 
use of prior knowledge as to the nature of the 
image by identifying specifically the mouth of the 
face represented, and further takes advantage of 
the fact that the mouth can be adequately repre- 
sented by a selected representative set of mouth 
data paits. 

According to a secoruJ aspect of the invention 
there is provided a speech synthesiser including 
means for synthesis of a moving image including a 
human face, comprising; 

(a) means for storage and output of the image of 
a face; 

(b) means for storage and output of a set of 
mouth data blocks (Fig. 3) each corresponding 
to the mouth area of the face and representing a 



respective different mouth shape: 

(c) an input for rece<vir^ codes identifyir^g 
words or parts of words to be spoken: 

(d) speech synthesis nr>eans ^responsive to the 
5 codes received at the sakJ input to synihesise 

words or parts of words correspond* og tfweto: 

(e) means storirtg a table relating such codes to 
codewords identifying said mouth data bkx:ks or 
S4X)uences of such codewords; and 

10 (f) control means responsive to the codes re- 
ceived at the said input to se'ec* the corre- 
spondir>g codeword or codeword sequence from 
the table and to output it in syfKhronism with 
synthesis of ihe corresporxJing word or part of a 
J 5 word by the speech synthesis means. 

According to a third aspect of the invention 
there is provided an apparatus for synthesis of a 
moving image, comprising: 

(a) means for storage and output of the image of 

20 a face: 

(b) means for storage and output of a set of 
mouth data blocks each corresponding to the 
mouth area of the face and representing a re- 
spective different mouth shape; 

25 (c) an audio input for receiving speech signals 
and frequency analysis means responsive to 
such signals to produce sequences of spectral 
parameters; 

(d) means storing a table relating spectral pa- 
30 rameter sequences to codewords, identifying 

mouth data bkx:ks or sequerK:es thereof: 

(e) control means responsive to the said spectral 
parameters to select for output the correspond- 
ing codewords or codeword sequences from \t\e 

35 table. 

Some embodiments of the invention will now 
be described, by way of example, with reference to 
the accompanying drawings, in which: 

Figure 1 is a block diagram of an image trans- 
40 mission system irKludtng an encoder and re- 
ceiver according to embodiments of the inven- 
tion; 

Figure 2 illustrates an image to be transmitted; 
Figure 3 illustrates a set of mouth shapes; 
45 Figures 4. 5 and 6 illustrate masking windows 
used in face, eyes and mouth identificafion; 
Figure 7 is a histogram obtained using the mask 
of fig 6; 

Figures 8 and 9 illustrate binary images of the 
50 mouth area of an image; 

Figures 10 and 11 are plan and elevational 

views of a head to illustrate the effects of 

changes in orientation and: 

Figure 12 illustrates apparatus for speech analy- 
55 sis; 

Figure 13 is a block diagram of a receiver 
embodying the invention. 
Figure 1 illustrates an image transmission sys- 
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tefT> with a transminer l. tfansmi$s»oo tmk 2 and 
rec9«vef 3. The techr»»ques emp*o/od are eqoaHy 
applicable to recording and the transmission link 2 
coukJ thus be re^3^3ce<i by a tape recorder or other 
means such as a semiconduclor store. 

The transmitter 1 recedes an input video signal 
from a source such as a C3fnefa. 

The movmg image to be transmitted is the face 
5 (fig 2) of a speaker whose speech is also trans- 
mitted over the Hnk 2 to the receiver. During nor- 
mal speech there is relatively Nttte change in most 
of the area d the face - i.e. other than tt>e mouth 
area tndicatad by the box 6 in fig 2. Therefore only 
one image of the face is transmitted. Moreover, it is 
fouTKj that ch»>ges in the mouth positions during 
speech can be reafisticalfy represented using a 
relatively small number of different mouth positions 
selected as typical. Thus a code-book of mouth 
positkx^ is generated, atvi. once this has been 
transmitted to the receiver, the only further informa- 
tion ttiat needs to be sent is a sequer^e of 
codewords identifyir>g the successive mouth posi- 
tions to t>e displayed. 

The system described is a krKwIedge based 
system - i.e. the receiver, folkwing a "learning" 
phase is assumed to "krK>w" the speakers face and 
tfie set of mouth positions. The operation of the 
receiver is straightforward arxJ involves, in the 
leamir>g phase, entry of the face image into a 
frame store (from which an output video signal is 
generated by repetitive readout) arnJ entry of the 
set of mouth positions into a further "mouth" store. 
ar>d. in the transmission phase, using each re- 
ceived codeword to retrieve the appropriate mouth 
image data arxJ overwrite the corresponding area of 

the image store. 

Transmitter operation is necessarily more com- 
plex ar>d here the learning phase requires a training 
sequence from the speaker, as foltows: 

1) The first franco is stored arnJ transmitted, 
suitably ef»coded (eg using conventional redun- 
dancy reduction techniques) to the receiver. 

2) The stored image is analysed in order to (a) 
identify the head of the speaker (so that the 
head in future frames may be tracked despite 
head movements), and (b) identify the mouth - 
i.e. defir^e the tx)x 6 shown in figure 2, The box 
co-ordirtates (aruj dimensions, if not fixed) are 
transmitted to the receiver. 

3) Successive frames of the training sequence 
are analysed to track the mouth and thus define 
the current position of the lx>x 6. and to com- 
pare the contents of the box (the "mouth im- 
age") with the first and any previously selected 
images in order to build up a set of selected 
mouth images. This set of images (illustrated in 
fig 3) is stored at the transmitter and transmitted 
to the receiver. 



The transmission *C'^a5e then requires: 

4) Successive frames are analysed (as «n (3) 
above) to identify tt>e position of the box 6: 

5) The content of ttv> box m ifte current frame is 
5 compared with the stored rriouth images to iden- 
tify tfiat one of the set which is nearest to it; the 
correspOTKiir^ codeword is then iransrr.itted. 

Assuming a frame rate of 25/secorxJ and a 
"codebook" of 24 mouth shapes (i.e. a 5-bit code). 

JO the required data rate during the transmission 
phase wouW be 125 tnts second. 

The receiver display obtained usir>g the basic 
system described is found to oe generally satisfac- 
tory, but is somewhat unnatural principally because 

15 '} the head appiears fixed ar>d (b) tfie eyes remain 
unchar>ged (specifically, the speaker appears never 
to blink). The first of these problems may be alle- 
viated by introducing random head movement at 
the receiver; or by tracking the head position at the 

20 transmitter and transmitting appropriate co-or- 
dinate^ to the receiver. The eyes coukJ be trans- 
mitted using the same principles as applied to the 
mouth; though here the size of the "codebook" 
might be much less. Similar remarks apply to the 

25 chin, and facial lines. 

The implementation of \he transminer steps 
enumerated above will now be considered in more 
detail, assuming a monochrcme source image of 
128 X 128 pel resolution, of a head and shoulders 

30 picture. The first problem is that of recognition of 
the facial features ai>d pinpointing them on the 
face. Other problems are determir-og the orienta- 
tion of lt>e head and the changing shape of the 
rTKXJth as well as the movement of the eyes. The 

35 method proposed by Nagao (M Nagao - "Picture 
Recognition and Data Structure". Graphic Lan- 
guages - ed Make and Rosenfield. 1972) is sug- 
gested. 

Nagao's mett>od involves producing a binary 
40 representation of tr>e image with an edge detector. 
This binary image is then analysed by moving a 
window down it and summing the edge pixels in 
each column of tf>e window. The output from the 
wirdow is a set of numtjers in which the large 
45 njmbers represent strong vertical edges. From this 
such features as the top and sides of the head, 
folkiwed by the eyes, nose and mouth can be 
initially recognised. 

The algorithm goes on to determine the outline 
50 of ttie jaw and then works t>ack up the face to fix 
the positions of nose, eyes and sides of face more 
accurately. A feedback process built into the al- 
gorithm altows for repetition of parts of the search 
if an erroi is detected. In this way the success rate 

55 is greatly improved. 

A program has been written using Nagao's 
algorithm which draws fixed size rectangles around 
the features identified as eyes and mouth. Details 
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of ihts OfC^ram a^e as fo**ov<s 

A Lao'acian operator is apij''e<i trygot^of 'viiti a 
threshoki to gtve a b«na/y in^age iho same 
resoMton Fige cw^eis become b*ack. othefS *rhtte 

A window of dimension 128 pe<s * 8 lines ts 
posittone<J at the top o* the bmary image. The 
biack pets m each column are summed and the 
result is stored as an entry m a 128 k 32 element 
array (ARRAY 1). The window is moved down the 
image by 4 lines each time and the process re- 
peated. The wtr>dow is repositioned 32 times m ail 
and the 128 x 32 element array is filled. (Fig 4). 

A search is coriducted through the rows of 
ARRAY 1 starting from the top of the image in 
order to locate the sides of the head. As the^e are 
strong vertical edges they will be identifieo oy high 
values in ARRAY 1 . 

The first edge located from the left side of the 
image is recorded and similarly for the right side. 
The distance between these points is measured 
(head width) and if this distance exceeds a criterion 
a search is made for activity between the two 
points which mcy indicate the eyes. 

The eyes are found using a one-dimensional 
mask, as illustrated in fig 5 which has two slots 
corresponding to the eyes separated by a gap for 
the nose. The width of the slots and their separa- 
tion is selected to be proportional to the measured 
head width. The mask is moved alqng a row within 
the head area. The numbers in ARRAY 1 falling 
within the eye slots are summed and from this 
result, the numbers in the nose slot are subtracted. 
The final result is a sensitive indicator of activity 

due to the eyes. 

The maximum value along a row is recorded 
along with the position of the mask when this 
maximum is found. The mask is then moved down 
to the next row and the process repeated. 

Out of the set of maximum values the overall 
maximum is found. The position of this maximum 
is considered to give the vertical position of the 
eyes. Using the horizontal position of the mask 
when this maximum was found we can estimate the 
midpoint of the face. 

Next a fifteen pixel wide window, (fig 6) is 
applied to the binary image. It extends from a 
position just below the eyes to the bottom of the 
image and is centred on the middle of the face. 

The black pels in each row of the window are 
summed and the values are entered into a one- 
dimensional array (ARRAY 2), If this array is dis- 
played as a histogram, such features as the bottom 
of the nose, the mouth and the shadow under the 
lower lip show up clearly as peaks (Figure 7). The 
distribution of these peaks is used to fix the posi- 
tion of the mouth. 

The box position is determined centred on the 
centre of the face as defined above, and on the 



centre of the r^Ouff> 'fOw 3S tn * " 't '^r -ro 7tv**<» 
foso'unon. on a su*tabte t>j* s«/'J^ - ^;rt 40 
*»ide by 24 high 

The no't stage is to onsure to /:ootjfw:a- 

5 tion of the mouth <bO« posttioo) tn th*j first frame 
and during the learning (ar>d lfdnsmt35»on» prase is 
consistent - »o that the mouth is always o^ireO 
*ithin the t>oi^ Application of Nagao'^ aigcw.thm to 
each frame of a sequence m turn is 'o'/nd to ^how 

to a consi:Jerable error m registration of the mouth 
lx)x from frame to frame. 

A solution to this problem was four. J oy apply- 
ing the algorithm to the first frame only and then 
tracking the mouth frame by frame T-us is 

IS achieved by usmg the mouth m the first frame of 
the binary sequence as a template and auto-cor- 
relating with each of the successive frames m the 
binary image referred to atx)ve- The search is 
started in the same relative position in the next 

20 frame and the mask moved by l pi^et at a time 
until a local maximum is found. 

The method was used to obtain a sequence 
using the correct mouth but copyng the rest of the 
face from the first frame. This processed sequence 

25 was run and sho /ed some registration jitter, but 
this error was only about one pixel, which ts the 
best that can be achieved without sub-pi -(el inter- 
polation. 

Typical binary images of the mouth area 
30 (mouth open and mouth closed) are shown in fig- 
ures 8 and 9. 

Only a small set of mouths from the total 
possible in the whole sequence can be stored in 
the look-up table, for obvious reasons. This re- 
35 quires the shape of a mouth to be recognised and 
whether it is similar to a shape which has occurred 
previously or not. New mouth positions would then 
be stored in the table. 

The similarity of difference of a mouth to pre- 
40 viously occurring mouths thus needs to be based 
on a quantisation process in order to restrict the 
number of entries in the table. 

The method by which this is achieved is as 
follows, all processing being carried out on 
45 greyscale mouth images rather than the binary 
version referred to above. 

The mouth image from the first frame is stored 
as the first - initially the only - entry in a look-up 
table. The mouth image from each frame in ;he 
50 training sequence is then processed by (a) com- 
paring it with each entry in the table by subtracting 
the individual pel values and summing the absolute 
values of those differences over the mouth box 
area: (b) comparing the sum with a threshold value 
55 and. if the threshold is exceeded, entering that 
mouth image as a new entry in the table 

However, this particular method of finding the 
sum of the absolute differences is very susceptible 
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to movement. For example, twu identical images 
where the ^trcond one has been shitted by just one 
pixel tu the left would produce a very low value for 
the sum. whereas these two images should be 
seen as identical. If a small degree of movement 
wit^*' the overall tracking is permitted to try to 
compensate for the fact that the sum falls off 
drainatically if the image is displaced by oi»ty one 
pixel then a reduction in the size of tf>e look-up 
table can be achieved without a correspCMiding loss 
ot mouth shapes. Thi;, can be done if. at each 
flame, the mouth in the current frame is compared 
three times with each of the code-book entries - at 
the current position, shifted to the left by one pixel, 
and shifted to the right by or.e pixel, and the 
minimum sum found in each case. The result gen- 
erating the smallest minimum sum together with 
the value of the shift in the x-dn*:*ction is recorded. 
This movement could, of course be performed in 
both the X- and the y- directioi but it has been 
found that the majority of movement is in the x- 
direction. 

If the desired table size is exceeded, or the 
number of entries acquired during t^^e training se- 
qiionce is substantially lower f^an -le table size, 
then the threshold level is ad.ustec appropriately 
and the learning phase repeated: tc avoid exces- 
sive delay such conditions miqnt be jreoicted from 
the acqtiisition ro*.e. 

Or:.~* thp table has been constructed, the 
transnr.ission phase can corr.ir.encG. n which each 
successive mouth image is comp;^red - as de- 
scribed in (a) above - with .^it thos<r of the stored 
labie and a codeword iaentii.ing tne entry which 
gave the lowest summation rt^sult then tiansrra- 
ted. 

The computation rea -ire* o co this is large 
but can be decreased iJ an ornative searc ng 
method is adopted. The s\m\: st aite-native would 
be instead of loOKing at ill th nouns n the look- 
up table and finding tr-^ ^mn n sum to use the 
first one that has a su--^ wh is less than the 
threshold. On its own. hiS woi certainly be quic- 
ker, but would be lik to ^er from a large 
amount o* ierkineso if or*- m which the table 
is scanned were fixed -ei ne oraer in which 
tlir* table is scannec us je varied. A pre- 
fened variation requiN t. l of the order in 
.vtiich mouths from tht Je- k appear - a sort 
of rank-ordering - to " '?pi example, if the 
previous frame used n m the table, then 
one scans the table e < -nt frame starting 
with the entiy which * iOQ^- ^d most often after 
mouth 0 in the past. 5 , . If the sum of the 

absolute differences : -en ' current frame and 
mouth 5 IS less thari 'iro^ old tnen mouth 5 is 
chosen to represent • .t:o: • vanne If it is great- 
er than Ihfi thresh. ;Ovr- liong to the next 



mouth in the code-book '^hich rar appeared after 
mouth 0 the second most often, - -.v^s on. When a 
mouth is finally chosen, the recoio ot which mouth 
is chosen is updated to include tl.o current informa- 
5 tion. 

Optionally, mouth images ~a^; - n a 'owest sum- 
mation r^^sult above a set value PT.nt be recog- 
nised as being jhapes not peasant .n the set and 
initiate a dynamic update process in wh^ch an 

TO additional mou»h image is apperd^d to the table 
and sent to the receiver during Vie transm^sc on 
phase. In most circ urn stance :> ,ran:;m:r.sion of the 
"new" mouth would not be fast eno^g.* to permit 
its use for the frame giving ri^c to i:. but it would 

t5 be available for future occurrences ■ ihol shape. 

Care must be taken in th s caue i '^^c set 
value is not too low because this can rasult in new 
mouths being placed irf: the 'ook-uc table all the 
way through the sequence. A. .n thi:^ is no more 

20 than image sub-san.pling wh.cJ vo.O:; obviously 
produce a reasonable result bu; a ich v^ould need 
a code-book whose size is p - tiiL to the 
length of the sequence being -.ec 

The set value can be arr v^ . r/ : -'*'^Mgh trial 

25 and error. It would obviously i--^ dcs^-^ble if this 
threshold could be selected a:!totnauC? ly. or dis- 
pensed with altogether. The sut^- ot ii -» absolute 
differences between frames is always a posuive 
measure, and the look-up tabi therefore repre- 

30 sents a metric space. Each mo;^..i ir. the look-up 
table can be thought of as oxioting in a multi- 
dimensional metric space, and each frame in a 
sequence lies in a cluster aruund one of these 
codebook mouths. Various algo it'^ms such as the 

35 Linde-Buzo-Gray exist which could be used to find 
tho optimum set of mouths. These algorithms use 
the set of frames in the sequence as a training set 
and involve lengthy searches to minimise the error 
and find the optimum set. Preferable to this is to 

40 find a "representative** set of mouths which are 
sub-optimal, but which can be found more quickly 
than the optimum set. In order to do this it is 
necessary to specify the number of mouths that 
are to be used, and then to select the required 

45 number of mouths from the training sequence. The 
look-up table can still be updated during the trans- 
mission phase using the same algorithm as for 
training, but the total number of mouths in the table 
will remain constant. 

50 The selection of mouths follows a basic rule - if 

the minimum distance (distance can be used since 
it is a metric space) between the current frame and 
one of the mouths in the table is greater than the 
minimum distance between that mouth in the table 

55 and any other mouth in the table then the current 
mouth should be included in the table, if it is less, 
then that mouth is simply represented by the near- 
est mouth in the table. When a new mouth is to be 
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included in {he labte during a transmission phase 
then {tie mootfi ttiat has to be removwj is selected 
according to tt>e following rule - find the pair of 
mouths in the look-up table that are closest to- 
gether an*^ throw one of them away, preferably the 
or>e that is nearest to the new mouth. 

When a new mouth is entered in the table, then 
clearly it has no past histoi7 with which to order 
tne other mouths in tt e code-book - each will never 
have appeared after this new mouth. When the 
next frame in the sequence is erKOuntered. the 
look-up table would be scanr>ed in order, arriving at 
the new entry last. However, this new entry is the 
most likely choices, since mouths tend to appear in 
clumps, particularly just after a new mouth has 
tjeen created. So the ordering is adjusted so that 
the new mouth is scanned first. 

The at>ove-described transmission system 
might be employed in a picture-phone system em- 
ploying a standard telephone link: to allow for the 
learning phase, the image would not immediately 
appear at the receiver. Following the initial delay - 
perhaps 1 5 seconds assuming non-digital transmis- 
sion of the face - the moving picture would be 
transmitted and displayed in real time. 

A fixed mouth overlay can be used on a face 
orientated differently from the forward facing posi- 
tion if the difference is not too large. Also, it is 
clear that in ordflr to show general movements of 
the head such as nodding and shaking one must 
display the face ae seen from a number of different 
angles. A displayed head is unconvincing unless 
there is some general movement, if only random 
movement. 

In a system such as the one described, dif- 
ferent views of the face would have to be transmit- 
ted and stored at the receiver. If a complete set of 
data were sent for every different face position this 
would require excessive channel and storage 
capacities. A possible way around the problem is 
shown in Fig 10. 

The appearance of the face in the frontal posi- 
tion is represented by the projection (x1-x5) in 
plane P. I( the head, is turned slightly to one side 
its appoarance to the observer will now be repre- 
sented by (xr-x5') in plane P'. If the illumination of 
the face is fairly isotropic then a two dimensional 
transformation of (xl-x5) should be a close approxi- 
mation to (x1'-x5*). 

The important differences would occur at the 
sides of the head where new areas are revealed or 
occluded and. similarly, at the nose. Thus by trans- 
mitting a code giving the change in orientation of 
the head as well as a small set of differences, the 
whole head could be reconstructed. , The differ- 
ences for each head position could be stored and 
used in the future if the same position is identified. 

The concept of producing pseudo-rotations by 



2-D transformation is illustrated with reference to 
the "face" picture of Figure 11. 

To simulate the effect of vertical axis rotation in 
a direction such that the nose moves by a dis- 
5 placement S from left to right (as viewed): 

(1) Points to the left of <X1-Xr) do not move. 

(2) Points on the line (X2-X2*) move to the right 
wth displacements S/2. (Region (X1,xr.X2.X2*) 
is stretched accordingly). 

w (3) Points on the line (X3-X3*) move to the right 
with displacement S. (Region X2,X2\X3.X3') is 
stretched). 

(4) Points on the line (X4-X4') moves to the right 
by displacement S. (Region (X3.X3\X4.X4') is 

15 translated to right). 

(5) Points on the line (X5-X5*) move to the right: 
displacement S/2. (Region (X4,X4\X5.X5*) is 
shrunk). 

(6) Points to the right of the line (X6-X6*) do not 
20 move. (Region X5,X5\X6.X6* is shrunk). 

Two-dimensional graphical transformations 
could be used in a system for a standard videocon- 
ferencing application. In this system, human sub- 
jects would be recognised and isolated from non- 
25 moving foreground and background objects. Fore- 
ground and background would be stored in mem- 
ory at different hierarchical levels ar^ aiding to 
whether they were capable of occiuding moving 
objects. Relatively unchanging moving bodies such 
30 as torsos would be stored on another level as 
would more rapidly changing parts such as the 
arms and head. 

The principle of operation of the system would 
require the transmission end to identify movement 
35 of the various segmented parts and send motion 
vectors accordingly. These would be used by the 
receiver to form a prediction for each part in the 
next frame. The differences between the prediction 
and the true picture would be sent as in a standard 
40 motion compensation system. 

The system should achieve high data compres- 
sion without significant picture degradation for a 
number of reasons: 

1) If an object is occluded and then revealed 
45 once more the data does not have to be retrans- 
mitted. 

2) For relatively unchanging bodies such as 
torsos a very good prediction could be formed 
using minor graphical transformations such as 

50 translations and rotations in the image plane and 
changes of scale. The differences between the 
prediction and the true should be small. 

3) For the more rapidly moving objects a good 
prediction should still be possible although the 

55 . differences would be greater. 

4) It could treat subjectively important features 
in the scene differently from the less important 
features. For instance, faces could be weighted 
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more heavily than rapidly moving arms. 
A secorKj embodiment of the invention relates 
to the synthesis of a moving picture of a speaker to 
accompany synlhesised speech. Two types of 
speech synthesis will be considered: 

(a) Umited vocabulary synthesis in which digitis- 
ed representations of complete words are stored 
and the words are retrieved unde" controJ of 
manual, computer or other input and regener- 
ated. The manner of storage, whether PCf^ or 
as formant parameters for example does not 
affect the picture synthesis. 

(b) Allophone synthesis in which any word can 
be synthesised by supplying codes representing 
sounds to be uttered; these codes may be gen- 
erated directly from input text (text to speech 
systems). 

In either case there are two stages to the face 
synthesis; a learning phase corresponding to that 
described above, and a synthesis phase in which 
the appropriate mouth codewords are generated to 
accompany the synthesised speech. 

Considering option (a) first, the speech vocabu- 
lary will usually be generated by recording the 
utterances of a native speaker and it will often be 
convenient to use the face of the same speaker, if 
another face is desired, or to add a vision facility to 
an existing system, the substitute speaker can 
speak along with a replay of the speech vocabu- 
lary. Either way the procedure is the same. The 
learning phase is the same as that described 
above, in that the system acquires the required 
face frame and mouth look-up table. However it 
must also record the sequence of mouth position 
codewords corresponding to each word and store 
this sequence in a further table (the mouth code 
table). It is observed that this procedure does not 
need to be carried out in real time and hence offers 
the opportunity of optimising the mouth sequences 

for each word. 

In the synthesis phase input codes supplied to 
the synthesiser are used not only to retrieve the 
speech data and pass it to a speech regeneration 
unit or synthesiser but also to retrieve the mouth 
codewords and transmit these in synchronism with 
the speech to a receiver which reconstructs the 
moving pictures as described above with reference 
to figure 1. Alternatively the receiver functions 
could be carried out locally, for local display or for 
onward transmission of a standard video signal. 

In the case of (b) allophone synthesis, a real 
face is again required and the previously described 
learning phase is carried out to generate the face 
image and mouth image table. Here however it is 
necessary to correlate mouth positions with individ- 
ual phonemes (ie parts of words) and thus the 
owner of the face needs to utter, simultanpously 
with its generation by the speech synthesiser, a 



representative passage including at least one ex- 
ample of each allophone which the speech syn- 
thesiser is capable of producing. The codewords 
generated are then entered into a mouth look-up 
table in which each entry corresponds to one al- 
lophone. Most entries will consist of more than one 
codeword. In some cases the mouth positions cor- 
responding to a given phoneme may vary in de- 
pendence on the preceding or following phonemes 
0 and this may also be taken into account. Retrieval 
of the speech and video data takes place in similar 
manner to that descrit>ed above for the "whole 

word" synthesis. 

Note that in the "synthetic speech* embodi- 

5 ment the face frame, mouth image table and mouth 
position code words may. as in the transmission 
system described above be transmitted to a re- 
mote receiver for regeneration of a moving picture, 
but in some circumstances, eg a visual display to 

?o accompany a synthetic speech computer output, 
the display may be local and hence the "receiver" 
processing may be carried out on the same ap- 
paratus as the table and codeword generation. Al- 
ternatively, the synthesised picture may be gen- 

25 erated locally and a conventional video signal 
transmitted to a remote monitor. 

The question of synchronisation wilt now be 

considered further. 

A typical text-to-speech synthesis comprises 

30 the steps of: 

(a) Conversion of plain text input tc phonetic 

representation. 

(b) Conversion of phonetic to lower phonetic 
representation. 

35 (c) Conversion of lower phonetic to formant pa- 
rameters: a typical parameter update period 

would be lOms. 
This amount of processing involves a degree of 
delay; moreover, some conversion stages have an 

40 inherent delay since the conversion is context de- 
pendent (e.g. where the sound of a particular char- 
acter is influenced by those which follow it). Hence 
the synthesis process involves queueing and timing 
needs to be carefully considered to ensure that the 

45 synthesised lip movements are synchronised with 
the speech. 

Where (as mooted above) the visual synthesis 
uses the allophone representation as input data 
from the speech synthesiser, and tf the speech 

50 synthesis process from that level downward in- 
volves predictable delays then proper timing may 
be ensured simply by introducing corresponding 
delays m the visual synthesis. 

An alternative proposal is to insert flags in the 

55 speech representations. This could permit the op- s 
tion oi programming mouth positions into the 
source text instead of (or in addition to) using a 
lookup table to generate the mouth positions from 
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the allophones. Either way, flags indicating the pre- 
cise instants at which mouth positions change 
could be maintained in the speech representations 
down to (say) the lower phonetic level. The speech 
synthesiser creates a queue of lower phonetic 
codes which are then converted to formant param- 
eters and passed to the formant synthesiser hard- 
ware; as the codes are "pulled off the queue, 
each flag, once the text preceding it has been 
spoken, is passed to the visual synthesiser to syn- 
chronise the corresponding mouth position change. 

A third embodiment of the invention concerns 
the generation of a moving face to accompany real 
speech input. 

Again, a surrogate speaker is needed to pro- 
vide the face and the learning phase for generation 
of the mouth image table takes place as before. 
The generation of the mouth cod© table depends 
on the means used to analyse the input speech: 
however, one option is to employ spectrum analy- 
sis to generate sequences of spectral parameters 
(a well known toc^nique). with the code table serv- 
ing to correlate those parameters and mouth im- 
ages. 

Apparatus for such speech analysis is shown in 
Figure 12. Each vowel phoneme has a distinct 
visual appearance. The visual correlate of the au- 
ditory phoneme is called a viseme [K W Berger - 
Speechreading: Principles and Methods. Baltimore: 
National Educational Press. 1972. p73-l07]. How- 
ever many of the consonants have the same visual 
appearance and the most common classification of 
consonant visemes has only 12 categories. This 
means that there will be no visible error if the 
system confuses phonemes l^longing to the same 
category. Since there is less acoustic energy gen- 
orated in consonant formation than vowel formation 
it would be more difficult for a speech recogniser 
to distinguish between consonants. Thus the many 
to one mapping of consonant phonemes to con- 
sonant visemes is fortuitous for this system. 

A mc*^od of analysing speech would us© a 
filter bank 10 with 14-15 channels covering the 
entire speech range. The <st.otjsttc energy m each 
channel is integrated usmg a ieaky integrator 1 1 
and the output sampled 12 a! (ho video frame isio 
(every 40ms). A subject ts ro^utrod to pror>oun'-o 
during a training sequence a tutl sot of phofiomo 
sounds and the filter bank analysus th« spoech 
Individual speech sounds a/^i tdontifiod tjy ihro^hof- 
d<ng the energy over oacn rn^x o( i^npioc The 
sample values are stored to a s<jt o* m*)mo'y 
locations 13 ^hich are (atwittjO '•tin the apf^rop'tato 
phoneme name. Those form a sot of lompiaios 
^hicn subsequently are usoci to tdontily pi^oAomos 
m an unkrxjwn 5fx>ech signal tfcxn tho %.nno !iijl>- 
jfit,l. Tfiis iS do<>o by us:'>g the ftiiof oank to j'v 
atyv> t*\e uokno^n sp*j»Kf^ .*! ^t^i s.im\m lAinpJing 



rate. The unknown speech sample is compared 
with each of the templates by summing the 
squares of the differences of the corresponding 
components. The best match is given by the smalt- 

5 est difference. Thus the device outputs a code 
corresponding to the best phoneme match. There 
would also be a special code to indicate silence. 

While the subject uttered the set of phonemes 
during the training sequence a moving sequence of 

10 pictures of the mouth area is captured. By pinpoint- 
ing the occurrence of each phoneme the corre- 
sponding frame in the sequence is located and i 
subset of these frames is used to construct a cooo- 
buok of mouths. In operation a look-up table ^ 

15 used to find the appropriate mouth code from • e 
code produced by the speech analyser. The c;. • 
denoting silence should, invoke a fully close-.i 
mouth position. A synthetic sequence is created bv 
overlaying the appropriate mouth over the face ui 

20 video rate. 

As with the case of synthesised speech, the 
"recetver" processing may be local or remote. In 
the latter cpse, it is proposed, as an additional 
modification that the mouth image table stored at 
25 the transmitter might contain a larger number of 
entries than is normally sont to tho receiver, ""his 
would enabi** the table lo include mouth shapes 
which, in general, occur only rarely, but may occur 
frequently m certain types of speech: for example, 
30 shapes which correspond to sounds which occur 
only in certain regional accents. Recognition o* the 
spectral parameters correspoootng to such a sound 
would then initiate the dyna(7>ic 'update procoss 
referred to earlier to make trie rotovani mcuth 
35 shape(s> available at the receiver 

Tho construction of ao<-»ropriato disp'ay 
(receiver) arrangements for tf e above proporaJs 
will now tje further considered (soe f igure 13) 

A frame store tOO <s provtd*jd. into ^♦hich .u/- 
40 ing the learning phase the roceivod siiil ffanrna -5 
entered from an input decoder \0\. *hii»t •moi.in' 
store i02 stores the desired numoer (say 25t 
mouth positions Readout io<;tc 103 '«p«aiodiy 
roads the contents of tho frame store «i/x3 d<ids 
4b synchfonisir^ pulses to fo*>d a vrdoo mornlor 1C4 
tn tho transmission ph^so, foCow*>I co^Jo^O*-!* 
Tiijppiiod to a cryitioi i;nit 106 *^^<^ ^.or\uo*t ov*j*- 

^writing of tMO r*iiovdr( Juna ol trwj 'ram« ttf/^o 10» 
*ilh !ho COrroSpOryJir^g mouth '.'.rt n *u\M-^ Cli>ArV 
•>o this (jvfnwnting rwHHl* to (>«f f.i(J»0 v> 4* 'v.i \n r«> 
viitbio to Iho vio-#»*}f Tho^o Co^****! ^>o 

ducol by dtvMlir^ ir^ iJti'J^to af'i4 -f'to imaii 
biOChi *»'>d ovr'*»f itmf'j n a fjurOom '^r ^>*'>«^^v^*"^^ 
non-4*><3ii<inlial ma*>n«jf AHf)*rij!tp*oiy •* fr^*» 

Ihoso cOukj tx* pfofoaiJ***! ***ih *r^^ ^<!i,yo ',<xJAi***. 
j/Kl iwitCf^Kl tn art': v.t ''J i.roaEo T'o irC.<>' J''> 
rncvTjrriont 'n -jorno .ato^ t -r^a^ t.*^ ;«^.tt«r*«> ' 
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Simplify the proces? hy c-i.^'toy^Q ^" " v shifting of 
the window snnte. 

Claims 

1. An apparatus for encoding a moving innage 
including a human face (5) comprising: 

means (1) for receiving video input data; 
means tor output oi data enablif g one 
frame of the image to be reproduced: 

identification means arranged in operation 
for c?ch frame of the image to identity that 
part of the input data corresponding to the 
mouth (6) of the face represented and 

<a) in a first phase of operation to compare 
the mouth data parts of each frame with 
those of other frames to select a repre- 
sentative set (Fig. 3) of mouth data parts, to 
store Vr.e representative set and to output 
this set; 

(b) in a second phase to compare the 
mouth data part of each frame with those of 
the stored set and to generate a odeword 
to be output indicating whicr. memoer of the 
set the mouth data part of that ff ar^e most 
closely resembles. 

2. An apparatus acco*'ding to ctairr i in wnich th^ 
identification means is arranged m , jeration 
firstly to identify that part of one frame of input 
data corresponding to the mouth of tne face 
reprfeo=»nted and to identify the mouth part of 
succecsive frames by auto-correlation with 
data from the said one frame. 

3. An apparatus according to claim 1 or 2 ar- 
ranged in operation during the first phase to 
store a first mouth data part and then for the 
mouth data parts of each successive frame to 
compare it with the first and any other stored 
mouth daia oart and 'f the result of the com- 
parison exceeds a threshold value, to store 
and Output it. 

4. An apparatus according to claim 1. 2 or 3 in 
which the comparison of mouth data is carried 
out by subtraction of individual picture element 
value: and summing the absolute values of the 
differences. 

5. An apparatus according to ciaim 1, 2. 3 or 4 
including means for obtaining the coordinates 
of the position of the face witnin successive 
frames of the image and generating coded 
data representing those coordinates, 

6. An apparatus according to any one of the 
pieceding claims, in which during the second 
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phase in the event that the result of the com- 
parison t>etween a mouth data part and that 
one of the set which it most closely resembles 
exceeds a predetermined threshold, that data 
part is output and stored as part of the set. 

7. An apparatus according to any one of the 
preceding claims further including identification 
means arranged in operation for each frame of 
the image to identify that part of the input data 
corresponding to the eyes of the face repre- 
sented and 

(a) in the first phase of operation to com- 
pare the eye data parts of each frame with 
those of other frames to select a repre- 
sentative set of eye data parts, to store this 
representative set and to output the said 
set: 

(b) in the second phase to compare the eye 
data part of each frame with those of the 
stored set and to generate a codeword in- 
dicating which member of the set the eye 
data part of that frame most closely resem- 
bles- 

8. A speech synthesiser including means for syn- 
thesis of a moving image including a human 
face, comprising; 

(a) means for storage and output of the 
image of a face; 

(b) means for storage and output of a set of 
mouth data blocks (Fig. 3) each correspond- 
ing to the mouth area of the face and repre- 
senting a respective different mouth shape; 

(c) an input for receiving codes identifying 
words or parts of words to be sp)Oken: 

(d) speech synthesis moans responsive to 
the codes received at the said input to 
synthesise words or parts of words cor- 
responding thereto; 

(e) means storing a table relating such 
codes to codewords identifying said mouth 
data blocks or sequences of such 
codewords: and 

(f) control means responsive to the codes 
received at the said input to select the cor- 
responding codeword or codeword se- 
quence from the table and to output it in 
synchronism with synthesis ct ;he corre- 
sponding word or part f ^. ♦'ord by the 
speech synthesis means. 

9. A synthesiser according to Jaim 8 in which 
the speech synthesis me^nj inr' rtes means 
arranged in operation lor prc'^ sr. x and queu- 
ing the input codes, the q^'^-:^ , reading flag 
codes indicating changes ir t.-ou:!- dhape. and 
responsive to each flag coo- . tr^.-^mit to the 
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control means, after the speech synthesiser 
has generated the speech represented by the 
input code preceding that flag code in the 
queue, an indication where^^y the control 
means may synchronise the codeword output 5 
to the synthesised sp)eech. 

10. An apparatus for synthesis of a moving image, 
comprising: 

(a) means for storage and output of the io 
image of a face; 

(b) means for storage and output of a set of 
mouth data blocks each corresponding to 
the mouth area of the face and representing 

a respective different mouth shape; is 

(c) an audio input for receiving speech sig- 
nals and frequency analysis means (10. 11, 
12) responsive to such signals to produce 
sequences of spectral parameters; 

(d) rreans (13) storing a table relating spec- 20 
tral parameter sequences to codewords 
identifying mouth data blocks or sequences 
thereof; 

(e) control means responsive to the said 
spectral parameters to select for output the 25 
corresponding codewords or codeword se- 
quences from the table. 

11. An apparatus acceding to claim 8, 9 or 10 
further including frame store means (100) for 30 
receiving and storing data representing one 
frame of the image; 

means (103) for repetitive readout of the 
frame store to produce a video signal: and 

control means (105) arranged in operation 35 
to receive the selected codewords and in re- 
sponse to each codeword to read out the cor- 
responding mouth data block and to effect 
insertion of that data into the data supplied to 
the readout means (103). 40 

Revendlcatlons 

1. Un appareil destine h encoder une image mo- 
bile comprenant un visage humain (5) compre- 45 
nant: 

un moyen (1) (X)ur recevoir des donnees 
vid^o d'entree; 

un moyen pour sortir des donnees permet- 
tant k une trane de I'image d'etre reproduite; so 

un moyen d*identification agenc^ en fonc- 
lionnement poor chaque trame de I'image pour 
identifier r^l^ment des donnees d'entree cor- 
respondant S la bouche (6) du visage repr6- 
sente et 55 
(a) pour comparer, dans une premiere pha- 
se de fonctionnement. les elements de don- 
nees de la hoiiche de chaque trame avec 
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ceux d'autres trames pour choisir un ieu 
representatiff {f g. 3) d'e»ements de donr>^es 
de tx)uche, pour m^monser le ieu repr^sen- 
tatif et sortir ce jeu. 

(b) pour comparer, dans une deuxi^me pha- 
se, r^l^ment de dofin^e de bouche de cha- 
que trame avec cellos du jeu m^moris^ et 
engendrer un mot de code ^ sortir ir>diquant 
h quel ^l^ment du jeu r^l^ment de donr>ees 
de bouche de cette trame ressemble le plus 
etroitement. 

Un appareil selon la revendtcation 1, dans le- 
quel le moyen d*identification est agenc^ en 
fonctionnement pour identifier en premier lieu 
r^l^ment d'une premiere trame de donn^e 
d'entree corresfxtndant h la txsuche du visage 
repr6sent§ et pour identifier r^l^ment de bou- 
che de trames successives par une autocorre- 
lation avec les donnees provenant de ladite 
premifere trame. 

Un appareil selon la revendication 1 ou 2 
agenc4 en fonctionnement pour m^moriser, 
pendant la premiere phase, un premier 616- 
ment de donn^e de bouche et p)our comparer 
ensuite chacun des elements de donnees de 
t>ouche de chaque trame successive avec le 
premier et avec tout autre 4ldment de donn^e 
de bouche memorise et pour le m^moriser et 
le sortir si le resultat d^passe une valeur de 
seuil. 

Un appareil selon la revendication 1. 2 ou 3. 
dans lequel la comparaison de donnees de 
bouche est effectu^e par soustraction de va- 
lours individuelles d'^lements d'image et addi- 
tion des valeur s absolues des differences. 

Un appareil selon la revendication 1 , 2. 3 ou 4. 
comprenant un moyen pour obtenir les coor- 
donn^es de la position du visage k Tint^rieur 
de trames successives de I'image et engen- 
drer des donnees codees repr^sentant ces 
coord onnees. 

Un appareil selon Tune quelconque des pr4c6- 
dentes revendications, dans lequel. pendant la 
deuxi^me phase, dans le cas ou le resultat de 
la comparaison entre un Element de donr>6e 
de t)Ouche el I'element du jeu qui lui ressem- 
ble le plus Etroitement d^passe un seuil prede- 
termine, cet element de donnee est sorti et 
memorise en tant qu'eiement du jeu. 

Un appareil selon Tune quelconque des prEce- 
dentes revendications comprenant en outre un 
moyen d'identtfication agence en fonctionne- 
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ment pour chaque trame de Timage pour <Jeo- 
tifier r^i^menl <Je la denote d'enuee corres- 
poodant aux yetw du visage rep^seot^ et 

(a) pouf comparer, dan^ ur>e premiere p^a- 
se de foTKrtioonefnenl. les 6 laments de don- s 

d*yeux de chaque trame avec ceox 
d'autres trames pour choisir un ieu repr4- 
sentatif d'^i^mems de donn^es d'yeux. afm 
de m^moriser ce jeu repr^sentatif et de 
sortir (edit ieu; 'O 

(b) pour comparer, dans une deuxi^me pha- 
se, r^l^merrt de dorw>^ d'yeux de chaque 
trame avoc ceux du ieu mdmoris^ et pour 
engeridref un mot de code irKiiquant h quel 
^l^ment du ieu r^l^mern de dor^n^ d'yeux 
de cette trame ressembte te plus etroite- 
ment. 

Un synth^tiseur de parote incluant des rnuyens 
pour la synthase d*uf>e image mobile compre- 
nant un visage humain comprer>ant: 

(a) un moyen de memorisation et de sortie 
de rimage d'un visage; 

(b) un moyen de memorisation et de sortie 
d'un ieu de blocs de donn^e de bouche 
(Fig, 3) correspOTKlant chacun a la zof^e de 
bouche du visage et repr^sentant une forme 
respective diff^rente de bouche: 

(c) une entree destinee h recevoir des co- 
des identifiant des mots ou des elements 30 
de mot ^ dire; 

(d) des moyens de syntt^fese de la parole 
sensibles aux codes regus ^ ladite entree 
pour syntti^tiser des mots ou des elements 

de nr^ots qui leur corresponderit; 35 
(8) des moyens mdmorisant un tableau re- 
liant ces codes & des mots de code identi- 
fiant lesdils blocs de donn^es de bouche ou 
des sequences de ces dits mots de code; 
et 40 
(ff) un moyen de commande sensible aux 
codes regus h ladite entree pour choisir le 
mot de code ou la sequence de mots de 
code corresportdant dans le tableau et pour 
le sortir en synchronisme avec la synthase 
du mot ou de r^l^ment de mot correspon- 
dant par le moyen de syntt^se de la parole. 

Un synth^tiseur selon la revendication 8 dans 
lequel le moyen de synthase de la parole 
comprend un moyen agenc^ en fonctionne- 
ment pour traiter et mettre en file les codes 
d'entr^e. la file comprenant des codes de dra- 
peau indiquant des variations de ta forme de 
tK)uche. et sensible ^ chaque code de drapeau 55 
pour transmettre au moyen de commande. 
apr^s que le synth^tiseur de parole a engen- 
dre la parole representee par le code d'entree 
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pf4cedant ce code ce drapeau dans la itie. \jr\e 
irKltcation gr^ce h laquelte le moyon oe com- 
mande peut syfKhroniser la sortte du mot de 
code avec la parole synthetisee. 

10. Un appareil destine ^ synthetiser une image 
mobile comprenant: 

(a) un moyen de memorisation et de sortie 
de I'image d'un visage: 

(b) un moyen de nr»emorisation et de sortie 
d'un ieu de blocs de donnees de bouche 
correspondant chacun h la zone de bouche 
du visage et representant une forme res- 
pective difterente de bouche; 

•j) une entree audio pour recevoir des si- 
gnaux de parole et un moyen d'analyse de 
frequence (10. 11. 12) sensibles h de tels 
signaux pour produire des sequences oe 
parametres spectraux; 

(d) un moyen (13) memori sant un tableau 
reliant des sequences de parametres spec- 
iraux k des mots de code identifiant des 
blocs de donnees de bouche ou des se- 
quences de ceux-ci; 

(e) un moyen de commande sensible aux- 
dits parametres spectraux pour choisir, afin 
de les sortir h partir du tableau. les mots de 
code ou les sequef>ces de mots de code 
correspondan ts . 

11. Un appareil selon la revendication 8. 9 ou 10 
comprenant en outre un moyen (100) de me- 
morisation de trame pour recevoir et memori- 
ser les donnees representant une trame de 
I'image; 

un moyen (103) pour lire de fag on repetiti- 
ve la nr>emo*re de trame pour produire un 
signal video; et 

un moven de commande (105) agence en 
fonctionnement pour recevoir les mots de code 
chotsis et pour lire, en reponse ^ chaque mot 
de code, le bloc correspondant de donr^ees de 
txMJChe et pour effectuer ['insertion de cette 
donnee dans 'a donnee fournie au moyen de 
lecture (103). 

PatentansprUche 

1. Gerat zum Kodieren eines sich t>ewegenden 
Bitdes einschlieBlich eines menschlichen ■ Ge- 
sichts (5), Welches aufweist: 

eine Einrichtung (1) zum Empfangen von Vi- 
deoeingat^edaten; 

eine Einrichtung zur Datenausgal>e. wetche es 
gestattet. einen Datenblock des Bildes wieder- 
herzustellen; 



11 



EP 0 225 729 Bt 



eine im Betneb tu^ joden Oatont)ioci< des 
d<5S angeCKdnoio l^jentifihattonseionchtung i\Mry 
Identittjieren des Toites der Ein<5abodalon. 
wefche dem Murd ^6) des dargostedton Ge- 
sichts entsprechen und 

(a> um m emef orsten Betnebsphase dte 
Munddatenteile jedes Oaienblocks mtt do- 
nen anderor Datenbiocke zu vergleichen. 
um einen reprasentaiiven Satz (Fig. 3) von 
Munddatenteilen auszuw^hlen, den repra- 
sentativen SaU zu speichern und diesen 
Satz auszugeben; 

(b) um in einer zweiten Phase die Mundda- 
tenteile jedes Datenblocks mit denen des 
gespeicherten Satzes zu vergleichon ur " 
zum Erzeugen eines auszugebenden Code- 
worts, welches anzeigt, welchem Element 
des Satzes die Munddatenteile dieses Da- 
tenblocks am meisten ahneln. 

Gerat nach Anspruch 1 , in welchem die Identi- 
fikationseinrlchtung im Betrieb angeordnet ist, 
um als erstes denjenigen Teil eines Daten- 
blocks von Eingabedaten zu identifizieren, der 
dem Mund des dargestellten Gesichts ent- 
spricht und zum Identifizieren des Mundteits 
von nachfolgenden Datenbi5cken durch Auto- 
korrelation mit Daten des einen Datenblocks. 

Gerat nach Anspruch 1 oder 2, welches ange- 
ordnet ist um im Betrieb wahrend der ersten 
Phase einen ersten Munddatenteil zu spei- 
chern und dann fiir die Munddatenteile jedes 
nachfolgenden Datenblocks ihn mit dem ersten 
und jedem anderen gespeicherten Munddaten- 
teil zu vergleichen. und. falls das Ergebnis des 
Vergleiches einen Schwellenwert uberschreitet. 
ihn zu speichern und auszugeben. 

Gerat nach Anspruch 1. 2 oder 3. in welchem 
der Vergleich von Munddaten durch Subtrak- 
tion individueller Bildelemenlwerte und Sum- 
mieren der absoluten Werte der Differenzen 
durchgefuhrt wird. 

Gerat nach Anspruch 1 . 2. 3 oder 4 einschlie/J- 
lich einer Einrichtung zum Erhalten der Koordi- 
naten der Position des Gesichts innerhalb 
nachfolgender Datenblocke des Bildes und Er- 
zeugen kodierter Daten. welche diese Koordi- 
naten darstellen. 

Gerat nach einem der vorhergehenden An- 
spruche, in welchem wahrend der zweiten 
Phase in dem Falle. daO das Ergebnis des 
Vergleichs zwischen einerti Munddatenteil und 
demjenigen des Satzes. welchem es am mei- 



schrettet. dios*jf ^^x^^^^^v^ v/r-^ef-^'^ -t'^ 
ofn Tori des SaL. fS gospO'^^^" * 

s 7. Ger^t nach o<nom 'lc» vo*'^e'gc^O'vJori Ar>- 
spfuche. wetterhm mit e<ne* ider.!)fifcat»o<'ri'>»''- 
fichtung. wolche angooi-dnet ist. im Bot''*Jt"' 
(Oden Datenbiock des Bddes d«r.ion»g«n Ta,i 
dor Eingabedaten zu identifizieren. der don 

f/? Augen des dargestellten Gesicr is ent'p'icm. 

und 

(a) in der ersten Betriebsphase die a - '; - n- 
datenteiie jedes Datenblocks mtt denen an- 
derer Datenbldcke zu vergleichen. um etnen 

75 repf a sentativen Satz von Augendaterroiien 

auszuwahlen. diesen repr^sentativen Satz 
zu speichern und den Satz auszugeben. 
<b) in der zweiten Phase den Augendaten- 
toilen jedes Datenblocks mit denen des ge- 

20 speicherten Satzes zu vergleichen .jnd om 

Codewort zu erzeugen. welches angibt. woi- 
chem Element des Satzes der Augendaien- 
teil dieses Datenblocks am moisten ahnelt. 

PS 8. Sprachsynthetisator. welcher eine Einrichtung 
zur Synthese eines sich bewegenden Biides 
beinhaltet einschlieOlich eines menschliche/t 
Gesichts. wobei der Synthetisator aufweist: 

(a) eine Einrichtung zum Speichern und 
30 Ausgeben des Bildes eines Gesichts: 

(b) eine Einrichtung zur Speicherung und 
Ausgabe eines Satzes von Munddatenblok- 
ken (Fig. 3). deren jede dem Mundgebiet 
des Gesichts entsprechen und eine jeweili- 

35 ge unterschiedliche Mundform darstellen; 

(c) eine Eingabe zum Empfangen von Co- 
des, welche Worte oder Teile von zu spre- 
chenden Worten identifizieren: 

(d) eine Sprachsyntheseeinrichtung. welche 
40 auf den an der Eingabe empfangenen Code 

anspricht. um Worte oder dazu entspre- 
chende Teile von Worten zu synthetisieren; 

(e) eine Einrichtung. die eine Tabelle spei- 
chert. welche derartige Codes mit Codewor- 
ds ten in Beziehung setzt. welche die Mundda- 

tenblocke oder Sequenzen derariiger Code- 
worte identtfiziert; und 

(f) eine Steuereinrichtung, welche auf die an 
der Eingabe empfangenen Codes anspricht. 

50 um das entsprechende Codewort oder die 

Codewortsequenz von der Tabelle auszu- 
wahlen und sie synchron mit der Synthese 
des entsprechenden Wortes oder Teiles ei- 
nes Wortes von der Sprachsyntheseeinrich- 

55 tung auszugeben. 

9. Synthetisator nach Anspruch 8. in welchem die 
Sprachsyntheseeinrichtung eine Einrichtung 
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beinhaltet. dio angoordnet ist. um im Betrieb 
die Einaabecodes zu verarbeiten uno in War- 
teschlangen emzureihen, woDei die Wartesch- 
lange Kennzeichencodes enttialt. weiche An- 
derungen in dc. Mundform anzeigen. und in 
Antworl aut jeden Kennzeichencode zum Sen- 
df*n t.;.er Anzeige an die Steueremnchtung, 
nachdem der Sprachsynthetisator die Sprache 
erzeuyt hat. weiche durch den Eingab^code 
da/gest^llt wild, der dem K3nnzeichencode in 
der Warteschiange vorausgeht, wobei die 
Steuereinrichtung Cas an die synttietisierte 
Sprache ausgegebene Codewort synchronisie- 
ren kann. 

10. Ger5t zur Synthese eines s^ch bewegenden 
Bildes. wobei 'las Gerat aufweist: 

(a) eine Einrichtung zum Speichern und 
Ansgoben des Bildes eines Gesichts; 

(b) eine Einnchtung zum Speichern und 
Ansgeben eines Satzes on r/unddaten- 
bl6ckt>n, die jeweils dem Mundgebiet des 
Gesichts entsprechen und ^me jeweilige un- 
terschiedliche Mundform c.^tsteiien: 

(c) eine Audioeingabo zum Errotangen von 
Sprachsignalen und etne' Fi -auenzanaly- 
seeinricMung (10. 11, ^2\ welcne auf derar- 
tlge Signale anspricht z-jm E-zougwn von 
Sequenzen spektraler Paramet^^r: 

id) ei"^a Einrichtung <13), die eme Tabelle 
speic^ert. wDlche spektrale -^arameterse- 
quonzon n.it Codeworten ir Beziehung 
setzt. wobei Munddatenbloc- ^ oder Se- 
quenzen davon identiriziert wei .ien; 
(e) eine Steuereittric-itun -j. die eiut die spek- 
iralen Parameter anspric -^t. u^v fur eU\a Ans- 
gabe die entsprecnend-n Ccieworte Oder 
Codewortsequenzer voi ler Tdbeile auszu- 

wahlen. 
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Fig. 5. 
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